Software requirements

External software

Today I’ll be using JSONView, a browser extension that renders JSON output nicely in Chrome and Firefox. (Not needed, but recommended.)

R packages

  • New: jsonlite, listviewer
  • Already used: tidyverse, lubridate, hrbrthemes

Install these new packages now:

We might as well load the tidyverse now, since we’ll be using that a fair bit anyway. I’ll also set my preferred ggplot2 theme for the rest of this document.

Recap from last time

During the last lecture, we saw that websites and web applications fall into two categories: 1) Server-side and 2) Client-side. We then practised scraping data that falls into the first category — i.e. rendered server-side — using the rvest package. This technique focuses on CSS selectors (with help from SelectorGadget) and HTML tags. We also saw that webscraping often involves as much art as science. The plethora of CSS options and the flexibility of HTML itself means that steps which work perfectly well on one website can easily fail on another website.

Today we focus on the second category: Scraping web data that is rendered client-side. The good news is that, when available, this approach typically makes it much easier to scrape data from the web. The downside is that, again, it can involve as much art as it does science. Moreover, as I emphasised last time, just because because we can scrape data, doesn’t mean that we should (i.e. ethical, legal and other considerations). These admonishments aside, let’s proceed…

Client-side, APIs, and API endpoints

Recall that websites or applications that are built using a client-side framework typically involve something like the following steps:

  • You visit a URL that contains a template of static content (HTML tables, CSS, etc.). This template itself doesn’t contain any data.

  • However, in the process of opening the URL, your browser sends a request to the host server.

  • If your request if valid, then the server issues a response that fetches the necessary data for you and renders the page dynamically in your browser.

  • The page that you actually see in your browser is thus a mix of static content and dynamic information that is rendered by your browser (i.e. the “client”).

All of this requesting, responding and rendering takes places through the host application’s API (or Application Program Interface). Time for a student presentation to go over APIs in more depth…

Student presentation: APIs

If you’re new to APIs or reading this after the fact, then I recommend this excellent resource from Zapier: An Introduction to APIs. It’s short, but you don’t need to work through the whole thing to get the gist.

The summary version is that an API is really just a collection of rules and methods that allow different software applications to interact and share information. This includes not just websites and browsers, but also software packages like the R libraries we’ve been using.1

A bit more about API endpoints

A key point in all of this is that, in the case of web APIs, we can access information directly from the API database if we can specify the correct URL(s). These URLs are known as an API endpoints.

API endpoints are in many ways similar to the normal website URLs that we’re all used to visiting. For starters, you can navigate to them in your web browser. However, whereas normal websites display information in rich HTML content — pictures, cat videos, nice formatting, etc. — an API endpoint is much less visually appealing. Navigate your browser to an API endpoint and you’ll just see a load of seemingly unformatted text. In truth, what you’re really seeing is (probably) either JSON (JavaScript Object Notation) or XML (Extensible Markup Language).

You don’t need to worry too much about the syntax of JSON and XML. The important thing is that the object in your browser — that load of seemingly unformatted text — is actually very precisely structured and formatted. Moreover, it contains valuable information that we can easily read into R (or Python, Julia, etc.) We just need to know the right API endpoint for the data that we want.

Let’s practice doing this through a few example applications. I’ll start with the simplest case (no API key required, explicit API endpoint) and then work through some more complicated examples.

Application 1: Trees of New York City

NYC Open Data is a pretty amazing initiative. It’s mission is to “make the wealth of public data generated by various New York City agencies and other City organizations available for public use”. You can get data on everything from arrest data, to the location of wi-fi hotspots, to city job postings, to homeless population counts, to dog licenses, to a directory of toilets in public parks… The list goes on. I highly encourage you to explore in your own time, but we’re going to do something “earthy” for this first application: Download a sample of tree data from the 2015 NYC Street Tree Census.

I wanted to begin with an example from NYC Open Data, because you don’t need to set up an API key in advance.2 All you need to do is complete the following steps:

  • Open the web page in your browser (if you haven’t already done so).
  • You should immediately see the API tab. Click on it.
  • Copy the API endpoint that appears in the popup box.
  • Optional: Paste that endpoint into a new tab in your browser. You’ll see a bunch of JSON text, which you can render nicely using the JSONView browser extension that we installed earlier.

Here’s a GIF of me completing these steps:

Now that we’ve located the API endpoint, let’s read the data into R. We’ll do so using the fromJSON() function from the excellent jsonlite package. This will automatically coerce the JSON array into a regular R data frame. However, I’ll go that little bit further and convert it into a tibble, since the output is nicer to work with.

## # A tibble: 1,000 x 45
##    address bbl   bin   block_id boro_ct borocode boroname brch_light
##    <chr>   <chr> <chr> <chr>    <chr>   <chr>    <chr>    <chr>     
##  1 108-00… 4022… 4052… 348711   4073900 4        Queens   No        
##  2 147-07… 4044… 4101… 315986   4097300 4        Queens   No        
##  3 390 MO… 3028… 3338… 218365   3044900 3        Brooklyn No        
##  4 1027 G… 3029… 3338… 217969   3044900 3        Brooklyn No        
##  5 603 6 … 3010… 3025… 223043   3016500 3        Brooklyn No        
##  6 8 COLU… 1011… 1076… 106099   1014500 1        Manhatt… No        
##  7 120 WE… 1011… 1076… 106099   1014500 1        Manhatt… No        
##  8 311 WE… 1010… 1086… 103940   1012700 1        Manhatt… No        
##  9 65 JER… <NA>  <NA>  407443   5006400 5        Staten … No        
## 10 638 AV… 3072… 3320… 207508   3037402 3        Brooklyn No        
## # … with 990 more rows, and 37 more variables: brch_other <chr>,
## #   brch_shoe <chr>, cb_num <chr>, census_tract <chr>, cncldist <chr>,
## #   council_district <chr>, created_at <chr>, curb_loc <chr>,
## #   guards <chr>, health <chr>, latitude <chr>, longitude <chr>,
## #   nta <chr>, nta_name <chr>, problems <chr>, root_grate <chr>,
## #   root_other <chr>, root_stone <chr>, sidewalk <chr>, spc_common <chr>,
## #   spc_latin <chr>, st_assem <chr>, st_senate <chr>, state <chr>,
## #   status <chr>, steward <chr>, stump_diam <chr>, tree_dbh <chr>,
## #   tree_id <chr>, trnk_light <chr>, trnk_other <chr>, trunk_wire <chr>,
## #   user_type <chr>, x_sp <chr>, y_sp <chr>, zip_city <chr>, zipcode <chr>

Aside on limits: Note that the full census dataset contains nearly 700,000 individual trees. However, we only downloaded a tiny sample of that, since the API defaults to a limit of 1,000 rows. I don’t care to access the full dataset here, since I just want to illustrate some basic concepts. Nonetheless, if you were so inclined and read the docs, you’d see that you can override this default by adding ?$limit=LIMIT to the API endpoint. For example, to read in only the first five rows, you could use:

Getting back on track, let’s plot our tree data just to show it worked.

Not too bad. This would probably be more fun / impressive with an actual map of New York behind it. We’ll save that for the spatial lecture that’s coming up later in the course, though.

Application 2: World rugby rankings

Our second application will involve a more challenging case where the API endpoint is hidden from view. In particular, let’s try to scrape data on World Rugby rankings.3 Start by taking a look at the complicated structure of the website in a live session.

Challenge: Try to scrape the full country rankings using the rvest + CSS selectors approach that we practiced last time.

If you’re anything like me, you would have struggled to scrape the desired information using the rvest + CSS selectors approach. Even if you managed to extract some kind of information, you’re likely only getting a subset of what you wanted. (For example, just the column names, or the first ten rows before the “VIEW MORE RANKINGS” button). And we haven’t even considered trying to get information from a different date.4

Locating the hidden API endpoint

Fortunately, there’s a better way: Access the full database of rankings through the API. First we have to find the endpoint, though. Start by inspecting the page like we practised last lecture. (Ctr+Shift+I in Chrome. Ctrl+Shift+Q in Firefox.) This time, however, rather than mousing over individual objects on the page, head to the Network tab at the top of the inspect element panel. Then click on XHR.5




Refresh the page (Ctrl+R). You should see something like the below, which captures all traffic coming to and from the page. Our task now is to scroll these different links and see which one contains the information that we’re after. In this case, that’s the rugby rankings, but we could also look for other information on the page like upcoming matches.




The item that I’ve highlighted brings up a URL called https://cmsapi.pulselive.com/rugby/rankings/mru?language=en&client=pulse.

Hmmm… “API” you say? “Rankings” you say? Sounds promising.

Let’s click on this item and then open up the Preview tab.




Again, this looks good. We can see what looks the first row of the rankings table (“New Zealand”, etc.) To make sure, go back to the Headers tab, grab that https://cmsapi.pulselive.com/rugby/rankings/mru?language=en&client=pulse, and paste it into our browser. Note that I’m using the JSONView browser plugin for nicely rendered JSON output, so don’t be alarmed if your screen looks different to mine.




Boom. We’ve located our API endpoint. Let’s pull the data into R.

Pulling the data into R

There are several packages for reading web-based (and web API) data into R. The httr package that comes bundled with the tidyverse is a good example, while the curl package is another (very powerful) option. However, for the below I’m just going to use the jsonlite package’s fromJSON() function, which is perfect for reading in JSON objects like our API endpoint.

## List of 3
##  $ label    : chr "Mens Rugby Union"
##  $ entries  :'data.frame':   105 obs. of  6 variables:
##   ..$ team       :'data.frame':  105 obs. of  5 variables:
##   .. ..$ id          : int [1:105] 37 36 33 34 39 38 35 46 42 40 ...
##   .. ..$ altId       : logi [1:105] NA NA NA NA NA NA ...
##   .. ..$ name        : chr [1:105] "New Zealand" "Ireland" "Wales" "England" ...
##   .. ..$ abbreviation: chr [1:105] "NZL" "IRE" "WAL" "ENG" ...
##   .. ..$ annotations : logi [1:105] NA NA NA NA NA NA ...
##   ..$ matches    : int [1:105] 209 182 196 189 200 217 177 111 187 182 ...
##   ..$ pts        : num [1:105] 92.5 91.2 87.2 86.2 84.6 ...
##   ..$ pos        : int [1:105] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ previousPts: num [1:105] 92.5 91.2 87.2 86.2 84.6 ...
##   ..$ previousPos: int [1:105] 1 2 3 4 5 6 7 8 9 10 ...
##  $ effective:List of 3
##   ..$ millis   : num 1.55e+12
##   ..$ gmtOffset: num 0
##   ..$ label    : chr "2019-01-07"

So we have a nested list, where what looks to be the main element of interest, $entries, is itself a list.6 Nested lists are the law of the land when it comes to JSON data. Don’t worry too much about this now, but R ideally suited to handling this type of nested information. We’ll see more examples later in the course when we start working with spatial data (e.g. geoJSON) and you’ll even find that the nested structure can prove very powerful once you start doing more advanced programming and analysis in R.

Okay, let’s extract the $entries element and have a look at its structure. We could use the str() base function, but for complicated list elements the interactive widget provided by listviewer::jsonedit() is hard to beat. For completeness, I’ll then also preview the $entries$team sub-element.

##   id altId         name abbreviation annotations
## 1 37    NA  New Zealand          NZL          NA
## 2 36    NA      Ireland          IRE          NA
## 3 33    NA        Wales          WAL          NA
## 4 34    NA      England          ENG          NA
## 5 39    NA South Africa          RSA          NA
## 6 38    NA    Australia          AUS          NA

It looks like like we can just bind the $entires$team data frame directly to the other elements of the parent $team “data frame” (actually: “list”). Let’s do that and then clean things up a bit. I’m going to call the resulting data frame rankings.

## # A tibble: 105 x 7
##      pos   pts name         abbreviation matches previous_pts previous_pos
##    <int> <dbl> <chr>        <chr>          <int>        <dbl>        <int>
##  1     1  92.5 New Zealand  NZL              209         92.5            1
##  2     2  91.2 Ireland      IRE              182         91.2            2
##  3     3  87.2 Wales        WAL              196         87.2            3
##  4     4  86.2 England      ENG              189         86.2            4
##  5     5  84.6 South Africa RSA              200         84.6            5
##  6     6  82.4 Australia    AUS              217         82.4            6
##  7     7  81.8 Scotland     SCO              177         81.8            7
##  8     8  77.9 Fiji         FJI              111         77.9            8
##  9     9  77.3 France       FRA              187         77.3            9
## 10    10  77.0 Argentina    ARG              182         77.0           10
## # … with 95 more rows

BONUS: Get and plot the rankings history

The above looks great, except for the fact that its just a single snapshot of the most recent rankings. We are probably more interested in looking back at changes in the ratings over time. For example, back to an era when South Africa wasn’t so kak.

How do we do this? Well, in the spirit of art-vs-science, let’s open up the Inspect window of the rankings page again and start exploring. What happens if we click on the calendar element, say, change the month to “April”?

This looks promising! Essentially, the same API endpoint that we saw previously, but now appended with a date, https://cmsapi.pulselive.com/rugby/rankings/mru?date=2018-05-01&client=pulse. If you were to continue along in this manner – clicking on the website calendar and looking for XHR traffic – you would soon realise that these date suffixes follow a predictable pattern: They are spaced out a week apart and always fall on a Monday. In other words, the World Rugby updates its rankings table weekly and publishes the results on Mondays.

We now have enough information to write a function that will loop over a set of dates and pull data from the relevant API endpoint. NB: I know we haven’t gotten to the programming section of the course, so don’t worry about the specifics of the next few code chunks. I’ll try to comment my code quite explicitly, but I mostly want you to focus on the big picture.

To start, we need a vector of valid dates to loop over. I’m going to use various functions from the lubridate package to help with this. Note that I’m only to extract a few data points — one observation a year for the last decade or so — since I just want to demonstrate the principle. No need to hammer the host server. (More on that below.)

## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
##  [1] "2003-12-29" "2004-12-27" "2005-12-26" "2007-01-01" "2007-12-31"
##  [6] "2008-12-29" "2009-12-28" "2010-12-27" "2011-12-26" "2012-12-31"
## [11] "2013-12-30" "2014-12-29" "2015-12-28" "2016-12-26" "2018-01-01"
## [16] "2018-12-31"

Next, I’ll write out a function that I’ll call rugby_scrape. This function will take a single argument; namely a date that it will use to construct a new API endpoint during each iteration. Beyond that, it will do pretty much exactly the same things that we did in our previous, manual data scrape. The only other difference is that it will wait three seconds after running (i.e. Sys.sleep(3)). I’m adding this final line to avoid hammering the server with instantaneous requests when we put everything into a loop.

Finally, we can now iterate (i.e. loop) over our dates vector, by plugging the values sequentially into our rugby_scrape function. There are a variety of ways to iterate in R, but I’m going to use an lapply() call below.7 We’ll then bind everything into a single data frame using dplyr::bind_rows() and name the resulting object rankings_history.

## # A tibble: 1,569 x 8
##    date         pos   pts name  abbreviation matches previous_pts
##    <date>     <int> <dbl> <chr> <chr>          <int>        <dbl>
##  1 2003-12-29     1  94.0 Engl… ENG               17         92.1
##  2 2003-12-29     2  90.1 New … NZL               17         88.2
##  3 2003-12-29     3  86.6 Aust… AUS               17         88.4
##  4 2003-12-29     4  82.7 Fran… FRA               17         84.7
##  5 2003-12-29     5  81.2 Sout… RSA               15         81.2
##  6 2003-12-29     6  80.5 Irel… IRE               15         80.5
##  7 2003-12-29     7  78.0 Arge… ARG               14         78.0
##  8 2003-12-29     8  76.9 Wales WAL               15         76.9
##  9 2003-12-29     9  76.4 Scot… SCO               15         76.4
## 10 2003-12-29    10  73.5 Samoa SAM               14         73.5
## # … with 1,559 more rows, and 1 more variable: previous_pos <int>

Let’s review what we just did:

  • We created a vector of dates — creatively called dates — with observations evenly spaced (about) a year apart, falling on the Monday closest to Jan 1st for that year.
  • We then iterated (i.e. looped) over these dates using a function, rugby_scrape, which downloaded and cleaned data from the relevant API endpoint.
  • At the end of each iteration, we told R to wait a few seconds before executing the next step. The reason is that R can execute these steps much, much quicker than we could ever type them manually. It probably doesn’t matter for this example, but you can easily “overwhelm” a host server by hammering it with a loop of automated requests. (Or, just as likely: They have safeguards against this type of behaviour and will start denying your requests as a suspected malicious attack.) The “be nice” motto is important to remember when scraping API data.
  • Note that each run of our iteration will have produced a separate data frame, which lapply() by default appends into a list. We used dplyr::bind_rows() to bid these separate data frames into a single data frame.

Okay! Let’s plot the data and highlight a select few countries in the process.

Further resources and exercises

  • Tyler Clavelle has written several cool blog posts on interacting with APIs through R. I especially recommend going over — and replicating — his excellent tutorial on the GitHub API.

  • Greg Reda’s “Web Scraping 201: finding the API” covers much of the same ground as we have here. While he focuses on Python tools, I’ve found it to be a handy reference over the years. (You can also take a look at the earlier posts in Greg’s webscraping series — Part 1 and Part 2 — to see some Python equivalents of the rvest tools that we’ve been using.)

  • Ian London (another Python user) has a nice blog post on “Discovering hidden APIs” from Airbnb.


  1. Fun fact: A number of R packages that we’ll be using later in this course (e.g. leaflet, plotly, etc.) are really just a set of wrapper functions that interact with the underlying APIs and convert your R code into some other language (e.g. JavaScript).

  2. Truth be told: To avoid rate limits — i.e. throttling the number of requests that you can make per hour — it’s best to sign up for an NYC Open Data app token. We’re only going to make one or two requests here, though so we should be fine.

  3. Because what’s more important than teaching Americans about rugby?

  4. Note that the URL doesn’t change even when we select a different date on the calendar.

  5. XHR stands for XMLHttpRequest and is the type of request used to fetch XML or JSON data.

  6. I know that R says $entries is a data.frame, but we can tell from the str() call that it follows a list structure. In particular, the $entries$team sub-element is a itself data frame. Remember: R is very flexible and allows data frames within certain data frames (and lists).

  7. Again, don’t worry too much about this now. We’ll cover iteration and programming in more depth in a later lecture.